Project - Travel Package Purchase Prediction


Context:

Objective:

Data Information

The records contain the Customer's personal information and their travel details & patterns. It also contains Customer interaction information during their sales pitch and their learnings from those sales discussions.

The detailed data dictionary is given below:

Customer Details

  1. CustomerID: Unique customer ID
  2. ProdTaken: Whether the customer has purchased a package or not (0: No, 1: Yes)
  3. Age: Age of customer
  4. TypeofContact: How customer was contacted (Company Invited or Self Inquiry)
  5. CityTier: City tier depends on the development of a city, population, facilities, and living standards. The categories are ordered i.e. Tier 1 > Tier 2 > Tier 3. It's the city the customer lives in.
  6. Occupation: Occupation of customer
  7. Gender: Gender of customer
  8. NumberOfPersonVisiting: Total number of persons planning to take the trip with the customer
  9. PreferredPropertyStar: Preferred hotel property rating by customer
  10. MaritalStatus: Marital status of customer
  11. NumberOfTrips: Average number of trips in a year by customer
  12. Passport: The customer has a passport or not (0: No, 1: Yes)
  13. OwnCar: Whether the customers own a car or not (0: No, 1: Yes)
  14. NumberOfChildrenVisiting: Total number of children with age less than 5 planning to take the trip with the customer
  15. Designation: Designation of the customer in the current organization
  16. MonthlyIncome: Gross monthly income of the customer

Customer Interaction Data

  1. PitchSatisfactionScore: Sales pitch satisfaction score
  2. ProductPitched: Product pitched by the salesperson
  3. NumberOfFollowups: Total number of follow-ups has been done by the salesperson after the sales pitch
  4. DurationOfPitch: Duration of the pitch by a salesperson to the customer

Table of Contents (TOC)

- Importing Packages
- Unwrapping Customer Information
- Data Pre-Processing & Sanity Checks
- Summary of Data Analysis
- EDA Analysis
- Customer Profiling - Based on Products
- Model Building
- Bagging Technique Models
- Boosting Technique Models
- Stacking Technique Models
- Comparison - Bagging vs Boosting vs Stacking
- Recommendations

Importing required Packages:

Click to return to TOC



Unwrapping the Customer Information:

Click to return to TOC


Data Description: Click to return to TOC


Data Preprocessing & Sanity Checks

Click to return to TOC


Dropping the Customer ID Column

Age

Duration of Pitch

Monthly Income

Type of Contact

Number of Followups

Prefferred Property Star

Number of Trips

Number of Children Visiting

Gender

Converting columns that has Categorical values to a category type


Summary of Data Analysis

Click to return to TOC

Data Structure:

Data Cleaning:

Data Description:


Common Functions


EDA Analysis - Analyzing respective attributes to understand the data pattern

Click to return to TOC


Analyzing the count and percentage of Categorical attributes using a bar chart

Insights from Categorical Data

Click to return to TOC

Observations:


Analyzing the Numerical attributes using Histogram and Box Plots

Insights from Numerical Data

Click to return to TOC

Observations:

Univariate Analysis

Click to return to TOC

Aalyzing the Age of the Customers

Observations:

Analyzing the Duration of Pitch

Observations:

Analyzing the Number of Persons Visiting

Observations:

Analyzing the Number of Trips

Observations:

Analyzing the number of children visiting

Observations:

Analyzing the Monthly income of the Customers

Observations:

Bivariate Analysis

Click to return to TOC

Observations:

Click to return to TOC

Visualise variables association with Product Taken & its correlation

Analyzing the Categorial attributes with Product Taken

Observation:

Click to return to TOC

Analyzing the Numerical attributes with Product Taken

Observation:

Click to return to TOC

Observation:

Multivariate Analysis - Visualise association with Product Taken & correlation between other Features

Click to return to TOC


Age Group vs Product Taken

Observations:

TypeofContract vs Product Taken

Observations:

City Tier vs Personal Loan

Observations:

Occupation vs Prodcut Taken

Observations:

Gender vs Prod Taken

Observations:

Marital Status vs Prod Taken

Observations:

CityTier vs Type of Contact vs Prod Taken

Observations:

CityTier vs Gender vs Prod Taken

Observations:

Number of Followups vs Income vs Prod Taken

Observations:

Number of Followups vs Duration of Pitch vs Product Pitched vs Prod Taken

Observations:

Income vs Education vs Personal Loan

Observations

PitchSatisfactionScore vs Duration of Pitch vs Product Pitched

Observations


Profiling of Customers - Based on Product

Click to return to TOC


Profiling of Customers who have taken the Product - Overall

Observation: Profiling of Customers - Overall

Click to return to TOC

Profiling of Customers - Basic

Click to return to TOC

Observation: Profiling of Customers - Basic Product


Profiling of Customers - Standard

Click to return to TOC

Observation: Profiling of Customers - Standard Product


Profiling of Customers - Deluxe

Click to return to TOC

Observation: Profiling of Customers - Deluxe

Profiling of Customers - Super Deluxe

Click to return to TOC

Observation: Profiling of Customers - Super Deluxe

Profiling of Customers - King

Click to return to TOC

Observation: Profiling of Customers - King


Model Building


Click to return to TOC

Model evaluation criterion:

Model can make wrong predictions as:

  1. Predicting a customer will apply for the product but in reality the customer would not apply - Loss of resources

  2. Predicting a customer will not apply for the product but in reality the customer would have applied for the product. - Loss of opportunity

Which case is more important?

How to increase the customer to avail Personal loan i.e need to reduce False Negatives?

Split Data

Model building using Bagging Technique

Click to return to TOC


Modeling using Bagging Classifier

Observations:

Modeling using Random Forest

Observations:

Modeling using Decision Tree

Observations:

Modeling post Tuning Bagging Classifier

Observations:

Modeling Bagging Classifier with Weighted Decision Tree

Observations:

Modeling using Random Forest tuning

Observations:

Modeling using Tuned Decision Tree

Observations:

Comparing Bagging Models Performance Summary

Summary of Model Building - Bagging Techniques

Click to return to TOC


Model building using Boosting Technique

Click to return to TOC


Modeling using Adaboost Classifier

Observations:

Modeling using Gradient Boosting

Observations:

Modeling using XGBoost

Observations:

Modeling using Adaboost Classifier Hyper Tuning

Observations:

Modeling using Gradient Boosting Hyper Tuning

Observations:

Modeling using XGBoost Hyper Tuning

Observations:

Compairing Boosting Models Performance Summary

Summary of Model Building - Boosting Technique

Click to return to TOC

Model building using Stacking Technique

Click to return to TOC


Stacking Model - Base estimators(DecisionTree Tuned, Bagging Tuned, Gradient) & Final estimator(RandomForest Tuned)

Observations:

Stacking Model - Base estimators(AdaBoost Tuned, Gradient Tuned, DecisionTree) & Final estimator(XGBoost Tuned)

Observations:

Stacking Model - Base estimators(Weighted Bagging, DecisionTree Tuned, AdaBoost Tuned, Random Forest Tuned) & Final estimator(XGBoost Tuned)

Observations:

Compairing Stacking Models Performance Summary

Summary of Model Building - Stacking Technique

Click to return to TOC


Comparisons - Bagging VS Boosting VS Stacked

Summary compairing the various Ensemble Technique Models

Click to return to TOC

Model Analysis:

Based on the comparison of the top models picked from each technique: 

- Considering the F1 score of the Training data, default XGBoost has the highest score of 99.92%, followed by "Bagging Weighted DTree" with 99.69%, followed by "Stack BagT Grad DTreeT RFT" model with 99.45%. All these models are over fitting considering the variance (approximately around >25%) with the testing F2 scores

- "XGBoost Tuned" has the next better fitting model with a F1 score of 93.18% on training data and 73.02% on the testing data, with a variance of 20.2%. Accuracy is at 97.2%. This model doesnt seem to be overfitting and has a better predictive model with the testing data

- "Gradient Tuned" has a better generalization with the F1 score of 77.7% on training data and 59.15% on the testing date, with a variance of 18.6% only. Accuracy is at 92.77%. This model doesnt seem to be overfitting and has a better predictive model with the testing data

Important Features:

- From the "XGBoost Tuned" model analysis, features such as Passport, Designation_Exec, Maritial Status Single, Product Pitched Deluxe, Desgination SM, Product Pitched Super Deluxe and City Tier play an important part in identifying the possible customers

- From the "Gradient Tuned" model analysis, features such as Monthly Income, Passport, Age, Designation_Exec, DurationOfPitch, Status_Single, Number of Followups, City Tier and Number of Trips play an importan part in identifying the possible customers

Observation:

- As the final results depend on the parameters used/checked using GridSearchCV, there may be yet better parameters which may result in a better F1 performance and can be tuned further.

Recommendations:

Click to return to TOC

Based on the Customer Information:

Based on the Products taken by the Customers, we found the following insights that can be leveraged as recommendations for understanding the Customers:


Table of Contents (TOC)

- Importing Packages
- Unwrapping Customer Information
- Data Pre-Processing & Sanity Checks
- Summary of Data Analysis
- EDA Analysis
- Customer Profiling - Based on Products
- Model Building
- Bagging Technique Models
- Boosting Technique Models
- Stacking Technique Models
- Comparison - Bagging vs Boosting vs Stacking
- Recommendations